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Abstract — An increasing number of businesses are migrating 
their IT operations to the cloud. Liltewise there is an increased 
emphasis on data analytics based on multiple datasets and sources 
to derive information not derivable when a dataset is mined 
in isolation. While ensuring security of data and computation 
outsourced to a third party cloud service provider is in itself 
challenging, supporting mash-ups and analytics of data from 
different parties hosted across different services is even more so. 
In this paper we propose a cloud-based service allowing multiple 
parties to perform secure multi-party secure sum computation 
using their clouds as delegates. Our scheme provides data privacy 
both from the delegates as well as from the other data owners 
under a lazy-and-curious adversary (semi-honest) model. We 
then describe how such a secure sum primitive may be used 
in various collaborative, cloud-based distributed data mining 
tasks (classification, association rule mining and clustering). We 
implement a prototype and benchmark the service, both as a 
stand-alone secure sum service, and as a building block for more 
complex analytics. The results suggest reasonable overhead and 
demonstrate the practicality of carrying out privacy preserved 
distributed analytics despite migrating (encrypted) data to pos- 
sibly different and untrusted (semi-honest) cloud services. 

I. Introduction 
A. Multi-Party Computing Service. 

An enormous amount of data is being generated everyday 
by a plethora of human activities and computing devices. 
Traditionally, data is stored in a data owner's in-house infras- 
tructure, and access to outsiders is provided typically through 
web services. Example services including MedlinePlus IT], 
Xignite 12, NOAA |3|, ResMap |4|, Yahoo! Traffic |5l etc. 
offer a wide range of data: medical, financial market, me- 
teorological, satellite images, traffic, etc. Data from multiple 
sources can be mashed-up or jointly analyzed to create new 
services and infer information that can not be realized from 
a stand alone dataset. For instance, satellite images and data 
from weather sensors are used together to improve forecast, 
financial data from multiple institutions to make better market 
predictions |6|, traffic and human mobility data to aid urban 
planing |7|, medical and mobility data to yield more insights 
on spread of diseases |8|. 

However, it is not always desirable or feasible to expose the 
data itself, and yet, being able to carry out computations or 
analytics over the same may provide benefits without violating 
the privacy concerns. Multi-party computations Q, IITOl is one 
way to enable the same, and have recently been successfully 
deployed in applications such as blind auctions which require 
bidding information from parties to determine prices Mill . 

Even as multi-party computation protocols have matured 
to be applied in practice, for it to be widely adoptable, it is 
desirable to provide the same as a basic service. Our work 
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Fig. 1: Multi -party computation (MPC): In Traditional vs 
Multi-Cloud settings 

is motivated by this observation, as well as noting another 
common recent trend, namely the move by many organizations 
to cloud based services in order to eliminate or downsize the 
in-house IT infrastructure. 

B. Our Work. 

Recent developments of cloud computing have materialized 
a concrete platform for rapid realization of the service-oriented 
computing paradigm [12]. Cloud providers offer computing 
as a service, from which software services can be built, 
sold and integrated into complex applications. Companies 
are leveraging the cloud for its cheap, elastic and scalable 
resources. Migration of IT infrastructure are taking place, in 
which most data, application logics and front-end services 
are being moved to the cloud ifTSll . Many existing works 
focus on what to migrate Ill3l . lfT4l . ifTSl . assuming that the 
cloud is trusted. Others investigate mechanisms for protecting 
data privacy and for verifying computation correctness lfT6l . 
iflTl . m, mi, lEO). The latter works consider single-party 
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settings, i.e., how one data-owner can outsource its data in a 
privacy preserving manner, and still carry out analytics on the 
same. Furthermore, these works mainly explore the theoretical 
aspects of the problem. 

Our work concerns the design space of multi-party com- 
putation outsourcing, which has so far not been explored to 
the best of our knowledge. In particular, we design a multi- 
party computation service which is invokable by authorized 
users when taking part in a multi-party protocol. The protocol 
can be run over untrusted (curious and lazy) clouds that act 
as delegates, and it guarantees individual data owners' data 
privacy (from the delegated clouds, as well as from other 
parties) and lets users verify if the computation has been 
carried out correctly. Specifically, we consider secure multi- 
party sum computation. 

Figure [T] illustrates the settings of our work, contrasting it 
with the traditional multi-party computation setting (Figure 
[Ta] ). Traditionally, each party houses its databases and comput- 
ing components for handling business logics internally. To take 
part in a multi-party computation, they need to initiate network 
connections with other parties. Not only is negotiating access 
to internal network a troublesome endeavor ifTSll . but also 
the potential heterogeneity in the network and computational 
resources may hinder the overall performance ||2T| . 



Our work focuses on scenarios in Figure lb in which parties 
move their data to clouds (which is in any case happening for 
many other reasons |'T2ll, ifTSll ) and rely on the cloud-provided 
service for the multi-party computation. 

Specifically, we propose a protocol allowing the cloud 
delegates to evaluate the sums of parties' private inputs without 
learning either the individual values or the resulting sums. The 
parties are also able to check if the sum is correct. 

Not only does cloud based deployment provide the basic 
computation as a service, more sophisticated analytics services 
can be realized using it as building block. We demonstrate this 
by designing some representative data mining tasks such as 
Naive Bayes (classification), Apriori (association rule mining) 
and K-Means (clustering). The resulting protocols support 
encrypted databases, hence the cloud delegates cannot learn 
the private data. Naturally, there are overheads in using the 
service, as a trade-off for the security guarantees. However, it 
is amortized when used within complex applications. 

Contributions. Our contributions are as follows: 

1) We present a multi-party computation (secure sum) 
service which can be run on curious-and-lazy cloud 
delegates while maintaining data privacy and correct- 
ness. The service can work with encrypted databases, 
and it ensures computation correctness with a minimal 
coordination among the parties outsourcing the task. 

2) We demonstrate how this service can be used in complex 
cloud-based data mining jobs. 

3) We benchmark the secure sum service as a stand-alone 
application in a multi-cloud like environment. We also 
experiment with various data mining applications using 
our services. Compared to the traditional multi-party 
computation implementations without delegates (we re- 
fer to them as the non-delegated versions), there are nat- 
urally overheads due to the added security mechanisms. 



However, the experiments show that such overheads 
are within reasonable range for applying our approach 
in practice, and the cost amortization becomes more 
prominent with increased analytics workloads. 
Organization. The rest of this paper is structured as fol- 
lows: Section |ll] details the delegated secure sum protocol 
in an abstract manner Section III shows how complex data 



mining tasks can be built with this service. We consider three 
classic data mining algorithms: classification, association rule 



mining and clustering. Section IV follows with experimental 
benchmarkings of the protocol and the data-mining tasks. 
Related works are discussed in Section |V] before we draw 
our conclusions and outline the future directions for this work 
in Section |VI] 

II. Delegated Secure Sum 

We briefly explain two traditional approaches for multi- 
party secure sum computation, based upon one of which our 
service is built. We suppose a number of parties Pq,Pi,.. 
with private inputs xo,xi,.. wish to compute s — 
a privacy-preserving manner, so that Pi will not learn Xj for 

In the ring-based approach ll22l . the parties form a ring and 
messages are forwarded in a pre-defined direction. A master 
party P„i is elected and starts the computation by sending 
V ~ x,n + r (for a random r) to its immediate clockwise (or 
counter-clockwise) neighbor, which adds its own input to v 
and forwards the result along. Once arrived back at the master 
party, the final sum is obtained as s = v — r and broadcast to 
other parties. 

Another approach is based on broadcast communica- 
tion 1231 . 1241 . At the beginning, each party generates such 
that J2i ''i = 0- Next, Pi sends Vi — Xi + to other parties, 
and independently computes the sum s = X^i^i- work 
is an extension of this broadcast-based protocol. We discuss 
the trade-offs between broadcast and ring-based protocols in 



Section II-D which prompted us to make the specific choice. 



A. Model 

Our model consists of a number of parties and delegates. 

The party Pi sends its transformed input (f){xi) to its delegate 
Di, and receives s at the end. The delegates Do,..,Dn-i 
are connected to each other. They receive inputs from the 
parties, start a multi-party computation and finally send back 
the results. 

1) Adversary model.: We assume that delegates are curious 
and lazy. They will try to learn inputs from the parties 
and/or the resulting sum. To that end, they may listen to the 
communication channels, but will not try more active attacks 
that are to deliberately subvert the computation. For example, 
we do not consider attacks in which an adversarial delegate 
sends different values in a computation round in order to 
partition the result. However, we assume that delegates may 
have incentives to be lazy, i.e. to do as little as possible (and 
still charge the parties). Specifically, they may skip some (or all 
of the) computations, replay results from previous rounds, or 
even replace party inputs with other values to save computation 
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Fig. 2: Packed Paillier scheme, c plaintext values are packed 
into a single plaintext of B bits 

time, which could consequently compromise the integrity of 
the sum computation|^ 

We assume that every party/data-owner is honest-but- 
curious, i.e, a data owner will not diverge from the protocol 
but may try to learn other parties' inputs. This is a standard 
assumption often used in designing multi-party computation 
protocols, where the resulting approaches are simple yet 
secure, even though they do not deal with attack scenarios in 
which adversaries are in majority with respect to the honest 
parties. 

Dishonest delegates may collude with each other, and also 
with the parties. However, the party-delegate collusion is weak, 
in the sense that the party can ask for messages seen by the 
delegate, but the party would not reveal its secret keys. Since 
all the parties are legitimate recipients of the final answer, 
if a party colludes to the extent of providing a delegate the 
decryption key or the final outcome - then naturally, this (or 
any other) protocol can not safeguard against such a situation. 

2) Crypto model.: Each party has a public key pair 
(PKpjjPKSi), each delegate has a pair (DKpj,DKSi). We 
assume that parties are running a common, well-known ran- 
dom number generator (RNG). Our protocols use an ad- 
ditively homomorphic encryption scheme, whose encryption 
Enc(eK,m) and decryption Dec (dK, c) operations satisfy: 

Enc(eK, mi) ® Enc(eK, 7712) — Enc(dK, mi + 1112) 

for an operation 0. ECC-Elgamal ll25l and Paillier ll26ll are 
two of such schemes, both providing randomized encryp- 
tions. ECC-Elgamal uses additive groups of an elliptic curve, 
whereas Paillier relies on composite residuosity classes over a 
group. The former is more efficient, but requires homomorphic 
transformations of plaintexts to and from an elliptic curve, 
which is expensive. Our work uses the latter It requires larger 
bit-length, but we can employ an optimization that allows us 
to perform multiple encryptions at the same time. Specifically, 
we pack multiple values in one plaintext so that they can be 
encrypted at the same time ||T6l . Suppose plaintext values are 
at most h bits and Paillier's plaintexts are B bits. Suppose 
further that any sum value is smaller than 2*+'', then we 
can pack c values into a single plaintext, as demonstrated in 
Figure. |2j where c < [5^]. Denote 

pm{v) = 0||a;i||0||a;2..|l0|lxe 

as the packed Paillier message using values v = 
{xi,X2, ■■,Xc)- We can extract Xi (1 < « < c) as: Xj = 
pm{v)\i] = [pm{v) > ((c - i) * ^ + ] & (2*+^ - 1). The 

'This simplified adversary model is justified by legal and economic realities, 
where a commercial cloud service provider may try to over-charge customers 
or try to (passively) sniff information readily exposed without the liability 
of committing a criminal offense, as opposed to launching any deliberate 
(proactive) attack to subvert confidentiality. 



homomorphic property is maintained, i.e.: 

cipher = Enc(eK,pm(u)).Enc(eK,pm(u')) 
= Enc(eK,pm(u + v')) 

B. Delegated Secure Sum Protocol 

The protocol comprises four phases: SETUP, ENCRYPT, 
COMPUTE and VERIFY. The SETUP phase is run once to 
initialize the system and security parameters, the rest are 
needed for each computation round. 

Setup. The goal of this phase is two-fold. First, parties 
agree on a secret Paillier key pair (eK, dK), and an initial 
value roundld. Although Paillier is a public-key scheme, 
we treat it as a symmetric scheme, where both public and 
private key are kept secret. Second, each party generates a 
random value (later used for perturbing its inputs) such 
that ^« = (explained shortly). 

To generate (eK, dK) and roundld, the parties first agree 
on a secret X, then use it as the seed for the RNG. Having the 
same source of randomness, they run the same algorithms for 
creating (eK, dK) and roundld. One method for establishing 
X could be to appoint a master party which generates X 
and distributes it to the rest. This way, the master produces 
and sends 0{N) different ciphertexts to other parties. We 
instead adopt a cryptographic approach first proposed for 
secret key exchange |27|, in which every party contributes 
its own randomness to the final value. In our context, each 
party sends only 2 messages, while most of the computations 
can be outsourced to the delegates. Computations are done in 
a globally known group of prime order Gp{g): 

1) Pi generates a random value Xi and sends Zi = g^' to 
A. 

2) Di broadcasts z,;, then forwards a — (zi+i/z,;_i) once 
it has received from other delegates. 

3) Pi computes Xi — a^' and sends to Di 

4) Di broadcasts Xi, then forwards b = and c = 

Xr^Xl+i'..^,:-2 to 

5) Pi computes X = 6^*.c ~ gXi.x2+..+x„.xi ^ which is the 

shared secret seed between all parties. 
Having generated (eK, dK), Pi constructs Enc(eK,z||0) and 
sends it to Di which broadcasts to other Dj and consequently 
to party Pj. On receipt of rrij for all j ^ i. Pi decrypts it 
and checks if the message was properly formed. If successful, 
it means that X and (eK, dK) have been agreed upon by all 
parties. Each assigns the next random number from the RNG 
as roundld. 

A common RNG cannot be utilized to create satisfying 
Xj 7'i = 0. Instead, we adopt an approach proposed in f24\. 
Specifically: 

1) Pi generates random values Vij for i ^ j- 

2) Pi sends niij = Enc(PKpj, r^j), signj(mij) for all 
j 7^ i to Di, which then distributes it to Dj and 
subsequently to Pj. 

3) Pi decrypts and verifies signatures of rriji (j 7^ i). Then 
it computes: 
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It can be seen that Y.i Ti = Y.j^ii'^ij - = 0- 

Encrypt. In this phase, each party encrypts its private 
input Xi and sends to the delegate. More specifically, Pi packs 
roundid and Xi + into a single plaintext and encrypts it 
with Paillier, the result is rrii = Enc(eK, roundid || {xi + 
Tj)). When there are multiple inputs {xio,Xi-^, ..,Xik}, Pi 
packs them into a smaller number of plaintext rriio, nin,.. with 
increasing values of roundid. 

Compute. On receipt of from party Pi, the delegate 
Di broadcasts this value to other delegates. Once having all 
messages from other delegates, it computes 

c= Y[ "n^i = JJ^ Enc(eK, roundid 1 1 (xi + r^)) 

0<i<Ji 0<i<ri 

= Enc(eK, n.roundid 1 1 Xi) 

i 

and sends it back to the party. Each delegate also signs the 
result message: ai = signj(i || c). 

Verify. Once receiving c. Pi asks Di for its signature 
as well as signatures from t other delegates. Pi can decide at 
random the value of t and identities of the verifying delegates. 
For instance, it can set t = 2 and ask its delegate for the signa- 
tures from Di, Di^i, Di-^i: ai,ai-i,ai+i. Once successfully 
verifying that the signatures are correct. Pi decrypts it to get: 

s = Dec(dK, c) = rt. roundid 1 1 Xi 

i 

Finally, it extracts s[0],s[l] from s, and checks that s[0] — 
n. roundid before assigning s[l] as the final sum value. 

C. Security Analysis 

Our protocol has the following security properties. First, 
curious delegates and parties cannot see private inputs of the 
honest parties, even if they collude with each other. This 
is because each input Xi is randomized with r^. Since the 
generation of is done in a secure manner, only Pi will be 
able to recover Xi from {xi + ri). 

Second, delegates cannot see the sum value J^i^i- has 
the encrypted value s = Enc(eK, ti. roundid 1 1 X^i^O' 
cannot extract the sum since it has no knowledge of the 
decryption key. Recall that our model does not consider active 
collusion between dishonest parties and delegates in which 
secret keys are revealed to the delegates. 

Third, the protocol computes the correct sum from the 
party inputs, provided that at least one of the t verifying 
delegates is honest. This means dishonest delegates cannot 
skip computations or replay old values. They cannot also 
replace party inputs with other values without getting caught. 
The sketch of proof is as follows. Since the delegate does 
not know the encryption key, should it use an input different 
to what is given from the party, the VERIFY step will fail 
because s[l] ^ n. roundid. It cannot replay old values either, 
as each round is tagged with an unique value of roundid. 
The delegate can skip the computation in two ways. First, 
it may not use its party's input nii during the COMPUTE 
step. However, this causes the VERIFY step to fail since 
s[l] 7^ n. roundid. Second, it may ignore mj from other 



delegates. In this case, to ensure s[l] = n. roundid, the 
delegate must construct s ~ {rrii)'^ (raising nii to power of n 
is cheaper than multiplying n different values). However, the 
Verify step also checks for the results from other delegates, 
of which one is honest and therefore its result must be different 
to s. Hence, the verification will again detect Di's laziness. 

These security guarantees are achieved while assuming 
that the cryptographic schemes are secure. Attacks on these 
primitives will inevitably affect our protocols, but they are 
outside the scope of this paper The VERIFY step is done at the 
end of every round, which can be expensive over many rounds. 
We extend the protocol to support probabilistic verification, in 
which verification is carried out with a probability p at any 
given round. The probability of detecting misbehavior can be 
made arbitrarily high after a number of verifications. Specifi- 
cally, let ck be the number of random checks, P{p, n, ck) be 
the probability that misbehavior is detected after ck checks 
(assuming that delegates misbehave consistentljj^jl. Then: 

P{p,n,ck) = 1 - (1 

D. Broadcast vs Ring-based delegated secure sum 

The traditional ring-based protocol ll22l can be extended to 
support delegates as follows. A party Pi still encrypts {xi + Vi) 
in the same way using the shared Paillier key, and sends 
it to Di. The master delegate Dq starts the computation by 
forwarding mo along the pre-defined direction, with which 
each delegate on the path multiplies its party input and sends 
the result along the same direction. Once a message c arrives 
at Z)o> it was broadcast to other delegates and subsequently 
to their parties. Verification at parties involves performing the 
same checks as described above. This simple protocol ensures 
that delegates cannot learn private inputs and the sum outputs, 
however it is possible for lazy delegates to skip computations 
and render the sum value incorrect. For example, suppose 
Do sends mo to Di which then computes ci — mQ.mi and 
forwards to D2- Suppose D2 and D3 are both dishonest, they 
can bypass the computation C2 — ci.ni2 and C3 = C2.m3 by 
sending C3 = ci.ci to D4. The finally verification checks out, 
but the sum is not correct. 

To achieve correctness, several enhancements are needed. 
The master party first splits the value vq ^ xq + tq into two 
parts Vq and Vq such that Vq + Vq — vq. Dq then sends mj, 
to its left neighbor and m^j to its right neighbor (mQ,mo are 
the ciphertexts of Vq,Vq). Effectively, there are two streams 
of messages in opposite directions. Party Pi receiving (t'„ 
and crfjj from its two neighbors, it computes (Jq^j = a\^-^.mi, 
'^out — '^[n '^j forwards them to the next neighbor 
Two messages arriving at the master delegates are and ct''. 
During verification, party Pi checks the following conditions: 

1) Dec(o'') = Dec((T''') and Dec(o'')[l] = ri. roundid. 

2) Dec(o--„)[l] = i. roundid and Dec((T[„)[l] = [n — 
i). roundid and Dec((T-„) + Vi = Dec((To„j) and 
Dec(cr[„) + w» = Dec(CT;;„t). 

-If the delegate misbehaves probabiHstically with some probability p', then 
the detection probability will be 1 — (1 — p X p')" '^'^ 
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3) Dec(a^„)[2] + Dec«„,)[2] = DecKJ[2] + 
Dec(aLt)[2]=Dec(aO[2] 

This protocol's only advantage over our broadcast based 
design is that each delegate maintains two connections to 
its immediate neighbors. Therefore, the communication cost 
for the delegates are constant, whereas in our protocol each 
delegate must handle 0{n) messages. However, there are many 
added overheads. First, the master party has to wait until 
the message traverse the ring completely before broadcasting 
the result. This does not scale with an increasing number of 
delegates, as network latency will become significant. Second, 
parties need to establish other shared states besides Paillier 
keys. Particularly, they have to agree on the identity of the 
master party, and on which direction to forward the messages. 
In case of probabilistic verification, they also must agree on 
which rounds to carry out the checks. Having many shared 
states makes the system less robust in presence of failure. 
Finally, electing a node as a master party to whom the com- 
putation correctness is rested requires such trust assumption 
that may not be realistic in decentralized settings. 

III. Data Mining Applications 

A data mining algorithm can be modeled as a process of 
extracting knowledge from a database, which consists of two 
iterative steps: x ^ query (13) that queries a database D, and 
process (a;) that computes knowledge from the query results. 
In a distributed settings, the first step is executed locally, and 
the results are aggregated across multiple parties. In other 
words, a distributed data mining algorithm now comprises 
three steps: Xi query(£)i), x ^ aggregate(afi) 
and process(a;). Possible aggregate functions are: sum, set 
union, scalar product, etc. 1281 . Our protocol described in 
the previous section has provided an implementation for the 
sum function. We now take three classical data mining tasks 
(classification, association rule mining and clustering) and 
show how they can be realized utilizing our service in a 
collaborative, cloud-based settings. 

A. Secure Database Outsourcing 

Database outsourcing has always been an attractive option 
for small-to-medium businesses even before the era of cloud 
computing. Main reasons for moving data to third-party sites 
are: scalability, high-availability and cost effectiveness, freeing 
up an enterprise's resources for its core business priorities. 
With the advent of cloud computing, this trend has tremen- 
dously accelerated. 

For simplicity, we assume that databases are in relational 
format, and every attribute has a non-negative integer domain 
(other domains could be mapped into the integer domain, the 
details of which are not within the scope of our work). The 
party is the data owner, who wishes to move the data and 
computation to the cloud (or delegate). We assume that the 
party retains a local copy, but most queries and processing 
on the data are to be done on the cloud. This may be the 
case, for instance due to the availability of arbitrary amount 
of computing power and specialized tools (software services) 
on the cloud, which are hard to achieve in-house. 



That data residing at a third-party and queries being ex- 
ecuted remotely raise several security issues: data privacy, 
integrity, query completeness and query freshness. Since our 
adversary model for the cloud is curious and lazy (Sec- 
tion II-A| l, we will only deal with the data privacy and 
query completeness problems. Techniques for ensuring query 
freshness are discussed elsewhere ll29l . If30l . 

1) Data privacy.: To protect data privacy, encryptions can 
be used. Many encryption schemes exists, each differs to an- 
other in its security guarantee and the range of operations that 
can be done over the ciphertexts. When outsourcing data, the 
trade-off between security and possibility of computing over 
ciphertexts must be made in order for queries to be executable 
at a third-party |T6'1. Randomized encryptions (for example 
AES in CBC mode with randomized initialization vector) 
offer security against adaptive chosen-plaintext attacks, but no 
meaningful computations can be done. Deterministic encryp- 
tions (DETs) such as AES offers less security, but facilitates 
equality comparison: Enc(a;) = Enc(y) -k^ x — y. Order- 
preserving encryptions (OPEs) OTI . |f32 | support inequality 
comparisons on ciphertexts: Enc(a;) < Enc(j/) x < y. 
However, they have weaker security guarantee than DETs, 
since they reveal plaintext's order Another useful encryption 
primitive is homomorphic encryptions (HOMs) such as the 
Paillier scheme used in our secure sum protocol. Such schemes 
are inherently malleable. Using DETs, simple queries such as 
equality selection, COUNT and GROUP BY can be performed 
by database engines. With OPEs, MIN, MAX, SORT, ORDER 
BY are also supported. Furthermore, these can be done effi- 
ciently since the database engine can build its B+ tree directly 
from the ciphertexts. With HOMs, aggregate functions like 
SUM or AVG can be performed. 

Since there are different types of queries, multiple types 
of encryptions must be supported at the same time. In 
CryptDb [i6|, data is encrypted in multiple onions used for 
different use-cases, each onion is multi-layer: the outer-most 
layer is the most secure and the inner-most supports the most 
complex operations. The database queries required for the data 
mining algorithms in our work are limited to those supported 
by DETs and OPEs. We adopt a simple approach that stores 
two copies of the database at the cloud: one encrypted with 
AES and one with OPE. Column names and table names are 
also encrypted with AES. A database query is transformed to 
the encrypted version, by encrypting column name, table name 
and attribute values with the appropriate encryption key and 
scheme. For example, two queries: 

select COUNT from t 

where t.al = x AND t . a2 = y; 

select COUNT from t ORDERED BY t.al 
where t.al < x AND t.a2 < y; 

are translated into: 

select COUNT from t_aes 

where t_aes . AES (al ) = AES (x) 
AND t_aes .AES (a2) = AES (y) ; 
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select COUNT from t_ope 

where t_ope .AES (al) < OPE (x) 

AND t_ope .AES (a2) < OPE (y) 
ORDERED BY t_ope . AES ( al ) 

where t_aes and t_ope are the encrypted names of the AES- 
encrypted and OPE-encrypted table derived from the original 
table t. We assume that both database encryption and query 
translation is done by the data owner (or by authorized users). 

2) Query completeness.: Executing a query over a large 
database is an expensive operation. As the cloud is lazy, it 
has incentives to skip some parts of the data when performing 
the query, or even ignore the query altogether and return a 
random response instead. Li et al. proposed a datastructure 
called Authenticated Aggregation R tree (AAR tree |[33l ) that 
can produce a proof that the query has been executed over 
the complete data. However, the most complex queries that 
AAR tree can support is COUNT or SUM over range selection 
conditions, plus the proofs are expensive to construct and 
verify. 

We propose a probabilistic solution that takes advantage 
of the data owner having a local copy of the database. A 
straight-forward protocol would require the data owner to 
probabilistically execute a query q over its local copy of the 
data and compare the result with what returned from the cloud. 
Thus, let p be the probability that the cloud is lazy for any 
given query, then the probability of it getting caught after k 
checks will be 1 — (1— p)*^, whose value rapidly approaches 1. 
Notice that even though k may be small, executing q directly 
on the local database may not be desirable, as the party wishes 
to be involved as little as possible. 

We observe that even if the cloud ignores some parts of the 
data, the data mining output may still be accurate. We conduct 
experiments with NaiveBayes and K-Means algorithms to test 
the accuracy when blocks of data are removed at random. 
We divide the data into b blocks, and use mis-classification 
rates and root mean square errors as accuracy metrics. The 
result depicted in Figure |3|b] suggests that accuracy indeed 
remains high. The output of NaiveBayes, for example, is above 
99.6% correct even when 30% of the blocks are removed. 
Another observation is that when data is divided into blocks 
and the cloud removes a considerable number of blocks, the 
party may execute the query only over a small number of 
blocks in order to detect inconsistency. Our protocol is based 
on ||34| . and proceeds as follows. The query q is transformed 
to a list of smaller queries Q — (gi, (72, ■•, 9fc), each executed 
on one block. The party sends Q to the cloud. Let r, w be the 
number of queries in Q performed by the party locally and by 
the cloud at its site. The probability P{b,w,r) that the cloud 
stays undetected when performing only w out of b queries are: 

min{r,w) 

P(b,r,w)^^ y ,^ , r- 

(") ^ — ' rmnio — w,max{l,b — i)) 

\rJ i=max(0,w+r-b) ^ ' ^ ' " 

Figure [3ja] shows that the probability of successful cheating 
decreases exponentially when the cloud ignores more data. It 
indicates that the party only needs to execute the query over 
a small portion of its local data (10 — 15%) for the detection 
to be effective. This suggests that the party may not need to 



store the entire database locally: it may suffice to have 20% 
of the data (unknown to the cloud delegate) and refresh them 
at pre-defined intervals. We leave further investigation to this 
question for future work. 

In summary, our protocol is effective for ensuring query 
completeness: if a large amount of data is ignored, the cloud 
gets caught with very high probability; if a small part is 
ignored, the data mining algorithms are largely unaffected. 

B. Data Mining Algorithms 



Algorithm 1: Naive Bayes classification 

Input: Y, A, V, local party p 
Output: N, Ny, Afy.a.u for all y, a, v 
1 foreach y Y do 

^ QueryCount(;^ = y) 
foreach allrihule a £ A and v £ Va do 

|_ a QueryCount(a;a — v-^ 



■y) 



5 foreach y G Y do 

Ny -i— securesum ^-^y ^ 
foreach a ^ A and v ^ Va do 

Ny,a V ^ securesum ( A^,^ , 



1) Classification (Naive Bayes): A classification algorithm 
takes as input a set of labeled (training) data and outputs a 
classifier that can be used to label new (test) data. Let Y 
be a set of labels, e y be the label of x (a multi-variate 
vector). Let A = {01,02,..} be a set of attributes (columns) 
and V — {Vi\ be the set of attribute domains. Let N be the 
number of instances (rows), Ny be the number of instances 
(rows) with label y, Ny,a.v be the number of instances whose 
column a's value is v and whose label is y. The NaiveBayes 
algorithm (Algorithm [T]i returns {N, Ny, Ny a.v) for all y, a, v 
as the classifier 1.35.1 . The label of a new instance x is: 
argmaXy{^.Y{^ -^)- 

QueryCount(a;i = vi, ..) follows the protocol for query- 
ing outsourced databases described in the previous section. 
Basically, the party constructs a COUNT query of the form 



select COUNT (. ) from Data 

where (x_l=v_l) AND . . 

, then encrypts the names Data, vi,V2, ■■ before sending it to 
the cloud. The cloud executes the query, returns a result which 
is verified (if applied), secure sum(a;) invokes the secure 
sum service using x as party input. If verification of the query 
or secure sum process fails, the algorithm terminates. 

2) Clustering (K-Means): Clustering algorithms partition 
data into separate clusters such that distance between elements 
belonging to the same cluster is smaller than between ones in 
different clusters. The K-Means algorithm (Algorithm |2]l finds 
k clusters identified by their centroids (the mean centers of the 
clusters). Each party starts with a set of chosen centroids, then 
computes new centroids by grouping data into clusters using 
previous centroids. The algorithm works in multiple rounds 
until convergence (i.e., the set of centroids is unchanged 
compared to the previous round). This algorithm is slightly 
different from the standard K-Means ll36l . for we are only 



(a) Success probability (b) Data mining accuracy 

Fig. 3: Query completeness with probabilistic verification 



Algorithm 2: K-Means Clustering 



Input: Number of clusters k. A, local paity p 
Output: Set of centroids AI — {mi, .., m^} 

1 foreach nii £ M do 

2 1^ rrii — (i, z, . . . , i) 

3 C'' = foreach m^ e M do 

Cm, = 

foreach a G A do 

Cm. (a) <— QueryGroupBy(a, mi, M) 
C^' = CP . U C^ . (a) 



C" = C U C^^. 

9 foreach Cfi] e C do 

10 C[z] securesum (CP[i]) 

11 c ^ c 

12 foreach mi £ M do 

Extract C^ . from C 
foreach a A do 

1^ mi (a) ^ Mode(C^ .(a)) 

16 Repeat Step 3 M converges. 



dealing with integer-domain attributes (or categorical data). 
Specifically, data mode metric (the most frequently seen value, 
computed by the Mode function) is used instead of mean. 

Many different metrics exist for quantifying distance from 
an element a; to a centroid c. We use Manhattan distance 
for its simplicity. In particular: A(a;, c) = — Ci\. 

QueryGroupBy(a, m^, M) asks the cloud to return a list of 
frequencies for attribute a in the portion of data closest to the 
centroid mi. The database query has the following form: 

select a, COUNT (*) from Data as freq 
where A(a, m.^) < A(a, mg) AND ... 
AND A(a,mi) < A(a,mj_i) 
AND A(a,mi) < A(a,mj+i)AND . . . 
Group by a. Order by a 

As A is computed over OPE ciphertext, what returned 
from the cloud for QueryGroupBy might not be correct. 
OPE's only guarantee is Enc(a;) < Enc(j/) ^ x < y, 
hence it does not always follow that |Enc(x) — Enc(a;')| < 
|Enc(t/) — Enc(j/')| o |a; — x'\ < ly — y'\- Suppose that in 
the unencrypted database, an element x is closer to centroid 
ci than C2, the distance based on OPE values of x, c\ and ci 



might indicate that x is closer to C2. Our experimental study 
(discussed later) shows that the phenomenon occurs frequently, 
but the final clusters are very close to the clusters found using 
the unencrypted databases. 



Algorithm 3: Apriori association rule mining 

Input: Support threshold minsup and confidence threshold minconf 
Output: Set of association rules X — >- Y 

1 Li GenerateFrequentltemsetSizelO 

2 fc = 2 

3 Cfc GenerateCandidates (Lfc_i ) 

4 Bp = 

5 foreach c G C^ do 

6 I t QueryCount(c) 

7 \_ Bp = Bp U t 

8 foreach 1 < i < | C^ | do 



I B[i] securesum 

10 I Extract c. count from B[i] 

11 Lfc {c G Ck\c. count > minsup} 

12 Increase k and repeat from line 3 until L^, 

13 GenerateRules([J^ Lk,minconf ) 



3) Association rule mining (Apriori): Association rule 
mining algorithms extract relationships between attributes 
that occur frequently in the data. An association rule is 
of the form X ^ Y where X,Y C A. Apriori algorithm 
(Algorithm [3]) first determines frequent item sets containing a 
single item. GenerateFrequentltemSetSizel issues 
COUNT queries with different attributes. The results are 
merged into a set of candidates (larger item sets, using 
GenerateCandidates). The threshold value minsup 
specifies the lower bound for item set frequency. These 
steps are repeated until there is no more item set to 
be found. Finally, GenerateRules generates outputs 
by establishing rules between non-empty subsets and 
removing rules whose confidence values are below minconf. 
The details of GenerateFrequentltemSetSizel, 
GenerateCandidates and GenerateRules can be 
found in ED- 
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4 delegates 


encryption (512-bit) 


0.64 (±0.15) 


decryption (512-bit) 


0.36 (±0.11) 


encryption (1024-bit) 


4.53 (±0.18) 


decryption (1024-bit) 


2.49 (±0.17) 


signing (1024-bit RSA) 


1.01 (±0.005) 


signature verification 


0.38 (±0.05) 



Fig. 6: Cost of cryptographic operations (ms) 



IV. Evaluation 

A. Prototypes 

We have implemented a prototype for the protocols dis- 
cussed in the previous section^ The secure sum service is 
implemented in Java, in which AES encryptions and RSA 
signatures are provided by the Crypto-H- library |38|, OPE and 
Paillier encryptions by CryptDB library |16|. Communications 
between parties and delegates are done via Java sockets, using 
the thread-per-connection model. The data mining algorithms 
are implemented in Java, using the Weka library f39] 

We now discuss our experiments for evaluating the secure 
sum service, first as a stand-alone service, then as being a part 
of complex data mining applications. All experiments are run 
in a cluster of 16 nodes, each has a Xeon processor S.OGhz, 
running OCS5.1 (2.6.18-53E15smp) operation system with 
4GB of RAM. The machines are connected via InfiniBand 
20Gbps. 

B. Secure Sum Benchmark 

We first examine the secure sum service as a stand-alone 
application. The salient metric here is its throughput: number 
of sum operations completed per second, especially in com- 
parison with the traditional, non-delegate secure sum protocol. 

We set up a cloud-like environment with up to 6 parties 
and 6 delegates. We emulate the condition in which network 
connections for individual party have lower capacity and speed 
than those employed between cloud providers, by adding arti- 
ficial latency to messages sent from any party. We experiment 
with the differences in latency between ingress/egress and 
intra-cluster traffic, the results indicate an extra delay of 1 
ms. In the experiments, each party starts an infinite loop that 
invokes the secure sum service with a single value. Throughput 
is measured per second as the number of sum values returned 
successfully to the party. Figure |4ja] shows the recorded 
throughput at steady states, using 512-bit Paillier encryption. 
Our protocol achieves over 320 sums/sec. The effect of 
increasing the number of parties/delegates on throughput is 
small, although there is a slight reduction when the number 
of delegates increases. This is because, with more delegates, 
each will have to wait for more messages before computing 
the final sums. 

Figure |4jb] compares the service throughput with that of 
the non-delegated protocol, using 4 parties. The throughput 
is 83% for 512-bit PailHer, and drops to 35% for 1024- 
bit. Cryptographic operations: Paillier encryption, decryption, 
signature verification at the party, and the signing operations 
at the delegates are accountable for the observed overhead. 

' Source code is available on request and is being sanitized for release. 
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Fig. 8: Preparation (or one-off) cost of encrypting and loading 
the database) 



Their computational costs (quantified in terms of completion 
times) are detailed in Figure |6] In the non-delegate version, 
main factors restricting the throughput are network speeds and 
CPUs, which explain the high throughput as well as the large 
fluctuation. The close gap between our 512-bit service and the 
non-delegate version can be explained as follows. Using 512- 
bit Paillier, encryption and decryption at a party takes roughly 
1 ms, which is close to the overhead incurred in handling 
0{n) messages to/from the other parties instead of 1 message 
to/from the delegate. For 1024-bit, however, the overall cost 
is so dominated by encryption/decryption operations that it 
explains the low and consistent throughput. Figure [Sjillustrates 
the number of messages sent and received by each party during 
the secure sum protocol. With respect to network cost at a 
party, increasing the number of parties does not affect our 
protocol. 

C. Data Mining Performance 

Having explored the performance and cost associated with 
the secure sum service, we next evaluate the distributed data 
mining applications that use the service as a building block. 

We use both real and synthetic datasets for running data 
mining algorithms (Figure[7|a]). Three datasets: breast_cancer 
(small), mushroom (large, many row) and splice (large, many 
columns) are from the UCI Machine Learning Repository |40|, 
from which we synthesize larger datasets by extending them 
with random values from similar distributions. Other system 
parameters are summarized in Figure |7]^b]. The results pre- 
sented below, unless otherwise stated, are for 4 parties and 
with 1024-bit Paillier encryptions. 

In our implementation, the party encrypts its data with AES 
and OPE and uploads it to the delegate, which loads it into 
a MySQL server. The party needs to break its original data 
into h blocks and keeps parts of it locally. We select 6 = 10 
and load 20% of the data to the database server at the party. 
With 6 = 10 and r = 20%, the probability of cheating 
successfully is only 0.77 when the delegate skips 1 block, and 
drops to 0.04 when it skips 5 blocks. This is for one check, 
but gets arbitrarily small over a period of time with multiple 
checks. Each database uploaded to the delegates are divided 
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Fig. 4: Comparing sums/sec of delegated (our protocol) and non-delegated secure sum 
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Fig. 7: System variables 



into 10 smaller tables. When querying, the party generates 
10 queries from its original query, and assembles the partial 
results when they come back. With a pre-defined probability, 
the party executes queries on its local database (consisting 
of 2 small tables) and compares the outputs with what is 
returned from the delegate. Figure [8] shows the initial, one- 
off cost for encrypting the data at the party and for loading 
it at its delegate. The cost is proportional to the data size, 
with maximum of 30 seconds for the mushroom_x50 dataset. 
Compared to the original, the encrypted data uploaded to the 
delegate has bigger size (22.8(±0.14) times bigger), but the 
loading times remain below 45 seconds. 

The overall completion time for every data mining algorithm 
can be broken down into two components: secure sum and 
database query time. Figure |9] depicts this metric for different 
algorithms and datasets. A common pattern is that database 
queries are at least an order of magnitude more expensive 
than secure sum. The longest experiment takes 33 minutes 
(for running Apriori on splice_xlO), of which secure sum itself 
takes less than 2 minutes. This observation suggests that when 
used in real data mining algorithms, the overhead incurred by 
using the secure sum service has little effect on the overall 
performance. 

1 ) Query time: Figure [TO] illustrates the impact of increas- 
ing data size to the query time metric. It can be observed 
that query time does not always scale with the size of the 
data, especially for Apriori and KMeans. In particular, mush- 
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Fig. 12: Local database query time (for verification) vs 
database query at the cloud 



room xSO dataset is almost 50 times larger than mushroom, but 
in Apriori, the query times increases by more than two orders 
of magnitudes. Similarly, splice_xlO is 10 times bigger, yet 
query times for KMeans are roughly the same. These are due 
to intrinsic properties of the data mining algorithms which we 
attempt to explain in the following. 

We extract the number of queries and time per query, the 



results of which are shown in Figure 1 1 As many queries from 
the party are duplicates and subsequently cached at the party in 
Apriori (80% for the splice datasets), the figure shows only the 
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Fig. 9: Running times for different data mining algorithms. 
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Fig. 10: Database query time at different datasets 
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Fig. 11: Database query benchmark 
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number of queries that actually get executed at the delegate. 
For Apriori on the mushroom datasets, bigger data size causes 
longer execution time (from 16 to 231 ms) and a sharp increase 
in the number of queries (132 to 3214). Therefore the query 
time for the mushroom_x50 dataset is much longer 

For KMeans on the splice datasets, however, time per query 
does not increase much, whereas the number of queries actu- 
ally decreases. This explains why query time for splice_xlO is 
roughly the same as splice. Not only is the number of queries 
smaller when increasing data size, it is not the same for every 
run of KMeans on the same dataset (standard errors are from 
5 — 10%). We will revisit this behavior later. 

There are three types of database queries in our experiments: 
simple COUNT with at most 2 selection condition (Naive- 
Bayes), COUNT with multiple selection conditions (Apriori), 
GROUP BY and ORDER BY (KMeans). Figure [TTJa] shows 
the difference in execution times for these queries. The seem- 
ingly more complex queries (GROUP BY and ORDER BY) 
take less time. This is because these queries are executed 
over OPE databases, which are both smaller and more effi- 



ciently managed by the database engine (since they are order- 
preserving, the engine can build a B-i- tree directly from them). 



Figure 12 compares query execution time at the party versus 



at the delegate. It is clear that the latter takes much longer 
There are three reasons: the encrypted database being larger, 
the delegate executes every query over the entire database 
whereas the party does so over only 20%, and the verification 
process being probabilistic. The average difference is more 
than two order of magnitudes. This is a further evidence of the 
benefit of outsourcing databases to the cloud, with which the 
party only needs to perform a small amount of work locally. 

2) Secure sum time: We extract and analyse the time 
taken by the secure sum service. In contrast to the previous 



benchmark (Section IV-B i, only a small number of secure 



sum operations are used during the execution of the data 
mining tasks. Furthermore, one round of secure sum may 
involve multiple values (as opposed to one value per round 



in Section IV-B i, hence many of them could be packed 
into a single Paillier encryption as elaborated previously in 
Section II-A and Figure |2] 
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Fig. 13: Secure sum time with varying number of par- 
ties/delegates, using breast_cancer_x50 dataset 
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Fig. 14: Encryption times for Apriori algorithm, with varying 
number of sum values (with different datasets) 



Figure 13 shows the effect of the number of parties on 
secure sum. It can be observed, similarly in Figure |4] that 
increasing N has almost no impact on the secure sum time. 
However, the variance is visible, especially for NaiveBayes. 
The NaiveBayes experiments on this particular dataset involves 
a single round of secure sum, hence the inherent variance 
will not be amortized over multiple rounds. Second, when 
running as part of a data mining task, database operations will 
inevitably interfere with the secure sum protocol, and cause 
more variance. For example, party Pi may have finished its 
database queries and starts sending values to its delegate for 
secure summing, but delegate Dj (j ^ i) may still be busy 
with queries from its party, therefore Di will have to wait until 
after Dj finishes and receives a value from Pj. 



Figure 14 demonstrates the benefit of packing multiple 
values into a single Paillier encryption. The x-axis is the 
number of values the party sends to the delegate to be summed, 
which is smaller than the actual number of encryptions. It can 
be seen that encryption time rises with the number of sum 
values and the bit length. In addition, using 1024 bit with 
packing is faster than 512 bit without packing, because the 
benefit of parallel encryptions is 15-fold, whereas the speed- 
up gained when encrypting using 512 bit is less than 10-fold. 

3) Correctness of KMeans: As noticed earlier in Figure 11 
the number of database queries for a KMeans experiment 
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Fig. 15: KMeans correctness 



varies between different runs on the same dataset. This be- 
havior is caused by the potential error with the GROUP BY 
and ORDER BY queries over OPE databases (explained in 



Section 111-Ai. In brief, the query, even if executed truthfully 
by the delegate, may yield a different result to that from 
performing the query locally on the plain-text data. We refer to 
this as mismatched query. The errors affect convergence of the 
algorithm as well as the final centroids. All of our experiments 
with KMeans, however, converge to final centroids. 

To quantify the differences between centroids found by our 
protocols and what found by standard KMeans on plain-text 
data (called standard KMeans), we compute cluster error for 
each pair of centroid: 

where Ci^C'^ is the cluster found by our protocol and by 
the standard KMeans respectively. Vl{Ci) is the mean squared 
error (the mean squared distance of each member of a cluster 
to its centroid) of cluster Q. Figure 15 shows this metric for 



all datasets, together with the number of mismatched queries. 
The errors seem independent of how many mismatched queries 
there are. In all cases, our protocols yield clusters whose 
quality is very close to that obtained from the standard 
KMeans. 

D. Discussion. 

So far, we have quantified the cost for doing collaborative 
data mining on the cloud in a secure manner. There are 
overheads when using our secure sum service which alone may 
not be a favorable argument for moving one's IT infrastructure 
to the cloud. However, when used in the context of data 
mining, benefits of the cloud could outweigh the costs of 
maintaining one's own infrastructure. 

Let m be the number of unique sum messages sent and 
received by the party for secure summing during a data 
mining algorithm. Let a be the crypto cost for encrypting and 
decrypting a message (with additive homomorphic encryption 
schemes). Let q be the number of database queries and 
Cq the CPU cost for each query. The computation cost of 
performing the collaborative data mining algorithm on an in- 
house infrastructure can be estimated as: 

C = q.Cq 

while the cost using our cloud-based approach is: 

Cd — OL.m 

Thus, the overhead at each party becomes O = {Cd — C) = 
{a.m — q.Cq), which diminishes quickly and becomes nega- 
tive for complex data mining algorithms (larger q) or larger 
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datasets (larger Cq). It has been shown in Figure|9] for example, 
that the query costs are in orders of magnitude more than 
cryptographic costs incurred by the secure sum service. 

Since each party communicates only with a delegate, not 
only does it make security policy enforcement easier, but 
also saves network costs. In particular, the number of extra 
messages handled by each party in our protocols compared 
to the non-cloud version is 0{q — n.m), which for a given 
data mining tasks will decrease with more parties. Even with 
complex tasks where q is large, the marginal network cost will 
be more than offset by the computation saving. 

Furthermore, we note that a party needs to maintain (a small 
amount of) local data only if it desires to verify the task results, 
and is thus needed for resilience against lazy delegates. If 
delegates' laziness is not a concern, but only data privacy is, 
then no data needs to even be maintained locally. This means 
the party could benefit greatly from storage saving. 

In summary, as the workload (and the number of parties) 
increases, it is more beneficial to migrate data to the cloud 
and use our secure sum service for data mining tasks, than to 
maintain one's own infrastructure. There are other qualitative 
advantages of using the cloud, such as access to diverse tools 
and services that are provided on demand which may be 
difficult or expensive to acquire, deploy, maintain in-house. 

V. Related Work 

Our work is based on several areas of researches: out- 
sourced databases, secure multi-party computation and ver- 
ifiable computation. For outsourced databases, existing works 
concern authenticated datastructures for guaranteeing query 
freshness [i30J, [29| and completeness BTIl . [33|. Data privacy 
is considered in CryptDB |16|. Our work addresses the query 
completeness property, using a probabilistic approach based 
on |34;|. Querying outsourced databases, especially by a third 
party, may give rise to privacy issues relating to the query 
outputs. This issue is not within the scope of our work, but has 
been studied under differential privacy notion f42], f43|. The 
basic technique requires adding noises to the query outputs, 
and has mainly been applied to COUNT queries. More the 
number of queries there are, the lower the privacy guarantees 
in such approaches. It thus remains challenging to implement 
complex data mining algorithms within a restricted privacy 
budget |44]. 

Most protocols for secure multi-party computations (first 
proposed by Yao |9| and Goldreich et al. ifTOl ) are highly 
inefficient, especially under malicious adversary models. Our 
work assumes a semi-honest adversary model. Because the 
computation is outsourced, we have to take into account 
both the parties' and delegates' adversarial behaviors. Kamara 
et al. II2TI recently investigate how to outsource multi-party 
computation, but considering only a single delegate. However, 
in practice, different parties may be using different public 
cloud service providers (and some may deploy private or 
hybrid clouds), and hence investigating the multi-cloud setting 
is of essence. 

The use of homomorphic encryptions for privacy-preserving 
addition has been studied elsewhere ll23l . ||451 . Il24l . These 



works share the same model in which an untrusted aggregator 
collects inputs from multiple parties and computes the sum 
without learning their individual values. Each delegate in our 
model can be considered as such an untrusted aggregator, 
which is not only curious but also potentially lazy. Our 
protocol both preserves privacy and ensures correctness of 
the computation. Furthermore, these related works on secure 
sum do not go as far as considering their protocols as parts 
of complex, cloud-based applications such as collaborative 
data mining. In contrast, our effort has been equally on the 
conceptual foundations as well as actual implementation and 
benchmarking of the same. 

Our delegated computation model is a special case of 
verifiable computation, in which a client outsources its com- 
putations to a more powerful entity and is able to later 
verify the outputs. Theoretical results have shown that any 
computation can be outsourced with guaranteed input and out- 
put privacy |17|. However, a general protocol for outsourced 
computation is highly inefficient ll20l . Some systems propose 
to detect cheating and mis -computation at the expense of data 
privacy I.19J , [18|, but they rely on probabilistic checking and 
require the client to pre-compute the results or the delegate to 
commit certain values. Wang et al. f20l proposes a practical 
method to outsource linear programming to the cloud. These 
works consider a single party and delegate, as opposed to our 
model. 

VI. Conclusions and Future work 

In this paper, we have described a service that allows mul- 
tiple parties to take part in a secure multi-party computation 
(sum) in which computation is outsourced to a set of delegates. 
The protocol protects data privacy and ensures correctness 
of the computation against a lazy-and-curious delegate and 
curious party model. We have used the service in designing a 
cloud-based system for carrying out collaborative data mining. 
We discussed techniques for outsourcing databases to the cloud 
in a secure manner, and for checking if the cloud has executed 
queries truthfully. We have chosen three classical data mining 
algorithms representative of some standard tasks: NaiveBayes 
(classification), Apriori (association rule mining) and KMeans 
(clustering) to demonstrate how the secure sum service can be 
used in complex analytics. 

A prototype for the secure sum service and data mining 
applications has been implemented in Java and evaluated in 
a cloud-like environment with real-world datasets. Our exper- 
imental studies have quantified the service overhead caused 
by cryptographic operations. When used within data mining 
applications, however, the cost of performing database queries 
are orders of magnitude more significant than that of secure 
sum. For clustering algorithm, our cloud-based system does 
not always yield the exact clusters, due to potential query 
errors caused by the use of order-preserving encryptions, but 
they are very close to the outputs of standard KMeans running 
on unencrypted databases. As workloads increase (more par- 
ties, complex algorithm, bigger database), the savings achieved 
by moving to the cloud outweighs the overhead incurred by 
our secure sum service. 
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An immediate extension to our work is to conduct the 
experiments on real clouds, in order to quantify the real 
cost and efficiency. The cloud's elasticity will also enable us 
to scale our studies to many more parties and much larger 
datasets. 

We have shown with a proof of concept that it is possible 
to delegate multi-party computations to the cloud in a secure 
manner, and to realize secure, collaborative data mining ap- 
plications. In future work, we would like to consider other 
delegated computations besides sums, such as scalar vector 
multiplication, min/max, etc, which will consequently enable 
more complex data mining applications. Our current adversary 
model for the delegate is still semi-honest, extending it to 
a malicious model poses significant challenges and research 
opportunities. 

There exists other additively homomorphic encryption 
schemes besides Paillier |25|, which we intend to study and 
compare in the context of our work. We have not investigated 
rigorously the Setup phrase in which the group of parties is 
formed and agree on the keys. Dynamic group memberships 
could affect our protocol in interesting ways. Finally, we plan 
to incorporate differential privacy techniques into the query 
phrase, and investigate the maximum privacy budget needed 
to realize any data mining algorithm. 
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